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APPROXIMATION ALGORITHMS FOR NONBINARY 
AGREEMENT FORESTS 

LEO VAN IERSEL, STEVEN KELK, NELA LEKIC, LEEN STOUGIE* 

Abstract. Given two rooted phylogenetic trees on the same set of taxa X, the Maximum Agreement Forest 
problem (MAF) asks to find a forest that is, in a certain sense, common to both trees and has a minimum number 
of components. The Maximum Acyclic Agreement Forest problem (MAAF) has the additional restriction that the 
components of the forest cannot have conflicting ancestral relations in the input trees. There has been considerable 
interest in the special cases of these problems in which the input trees are required to be binary. However, in 
practice, phylogenetic trees are rarely binary, due to uncertainty about the precise order of speciation events. Here, 
we show that the general, nonbinary version of MAF has a polynomial-time 4-approximation and a fixed-parameter 
tractable (exact) algorithm that runs in 0(4 fe poly(n)) time, where n = \X\ and k is the number of components 
of the agreement forest minus one. Moreover, we show that a c-approximation algorithm for nonbinary MAF and 
a d-approximation algorithm for the classical problem Directed Feedback Vertex Set (DFVS) can be combined to 
yield a d(c + 3)-approximation for nonbinary MAAF. The algorithms for MAF have been implemented and made 



1. Introduction. Background. A rooted phylogenetic tree is a rooted tree with its leaves 
labelled by species, or strains of species, more abstractly known as taxa. Arcs, also called edges, are 
directed away from the root towards the leaves and internal vertices of the tree have indegree 1. 
It is a model used to exhibit ancestral relations between the species. For an introduction to 
phylogenetic trees see [15j [TU [29] . 



Occasionally it happens that on the same set of species topologically distinct phylogenetic trees are 
derived from different data sources (e.g. different genes). Such partly incompatible trees may arise 
due to biological reticulation events such as hybridization, recombination or lateral gene transfer 
p~8l l23l [24] . These events cannot be explained in a tree-like ancestral relation model. There are 
additionally many non-reticulate biological phenomena, such as incomplete lineage sorting, that 
can likewise lead to conflicting tree signals [23, 24 . Whatever the cause of the conflict, it is natural 
to wish to quantify the dissimilarity of two phylogenetic trees. 

m 

rSPR distance and Maximum Agreement Forests. One such measure of dissimilarity is the rooted 
Subtree Prune and Regraft (rSPR) distance, which asks for the minimum number of subtrees that 
need to be successively detached and re-attached in one of the trees to transform it into the other. 
^{■^ The search for an alternative characterisation of rSPR distance was a major motivation behind 

the study of the agreement forest problem that we now describe; a formal definition follows in the 
next section. We are given two rooted trees with the leaves labeled by the elements of a set X 
and no vertices with indegree and outdegree both equal to 1. An agreement forest is a partition of 
X such that (a) in both trees the partition induces edge-disjoint subtrees and (b) for each block 
("component") of the partition, the two subtrees induced are phylogenetically compatible i.e. have 
• i— i a common refinement; see Figure [T~T| The maximum agreement forest problem (MAF) is to find 

an agreement forest with a minimum number of components. 
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The binary MAF problem, in which the two input phylogenetic trees are binary, was introduced 
by Hein et al. [T7]. Bordewich and Semple [B] showed that modulo a minor rooting technicality it 
is equivalent to computing the rSPR distance. The first correct polynomial-time approximation 
algorithm for this problem was the 5-approximation algorithm by Bonet et al. (3J- The first 
3-approximation was given in [5], a quadratic-time 3-approximation in [27] and a linear-time 3- 
approximation in [30] ■ A fixed parameter tractable (FPT) algorithm for binary MAF was first 
given by Bordewich and Semple [5] , the running time of which was subsequently improved in [5] 
to 0(4 fc fc 4 + n 3 ) and in [50] to 0(2A2 k n). (See QH [55] for an introduction to fixed parameter 
tractability) . For more than two binary trees, there exists an 8-approximation [5]. 

However, in applied phylogenetics it rarely happens that trees are binary. Often there is insufficient 
information available to be able to determine the exact order in which several branching events 
occurred, and it is standard practice to model this uncertainty using vertices with outdegree 
higher than two: a soft polytomy |23| . Hence it is extremely important to develop algorithms 
for the nonbinary case i.e. when the trees are not necessarily binary and high-degree vertices 
capture a set of equally likely branching scenarios. The close relationship between rSPR distance 
and MAF also holds in the nonbinary case [11]. However, compared to the binary case neither 
problem has been well-studied. Nonbinary agreement forests were introduced in |27| . where a 
(d + l)-approximation algorithm for nonbinary MAF was given, with d the maximum outdegree 
of the input trees. A kernelization was presented in 11J, showing that nonbinary MAF is fixed- 
parameter tractable. Combining this kernelization of size 64/c with the trivial 0(n fc poly(n)) time 
exact algorithm gives a 0((64fc)' £ poly(n)) time FPT algorithm, with n the number of species and k 
the size of the maximum agreement forest minus one. In addition, the kernel automatically implies 
the existence of a polynomial-time 64-approximation algorithm (since any instance trivially has a 
solution of size n). 

In this article we give a polynomial-time 4-approximation algorithm and an 0(4 fc poly(n)) time 
FPT algorithm for nonbinary MAF. Both algorithms have been implemented and made publicly 
available. Although these results are of interest in their own right, we also show how to utilize 
them to obtain improved algorithms for computation of hybridization number. 

Hybridization number and Maximum Acyclic Agreement Forests. The other problem we study is 
a variation of MAF in which the roots of the subtrees in the agreement forest are not allowed to 
have conflicting ancestral relations in the two input phylogenetic trees. For an example, consider 
the agreement forest in Figure [T]T] In the first phylogenetic tree, the subtree with leaves c and d 
is "above" the subtree with leaves a and b, whereas it is the other way around in the second phy- 
logenetic tree. By saying that a subtree is "above" another subtree, we mean that there exists a 
directed path from the root of the first to the root of the second subtree that contains at least one 
edge of the first subtree. Hence, the agreement forest in Figure [TTT] is, in this sense, cyclic. Such 
cyclic phenomena are prohibited in the Maximum Acyclic Agreement Forest problem MAAF, in- 
troduced in [J. The main reason for studying MAAF, is its close connection with the hybridization 
number problem (HN). A hybridization network (often also called a rooted phylogenetic network) 
is a rooted phylogenetic tree that additionally may contain reticulations, vertices with indegree 
two or higher. Given two rooted rooted trees on the same set of taxa, the HN problem asks 
for a hybridization network with a minimum number of reticulations that displays (i.e. contains 
embeddings of) both the input trees. The HN problem can thus be viewed as constructing a 
"most parsimonious" explicit evolutionary hypothesis to explain the dissimilarity between the two 
input trees; the problem first gained attention following the publication of several seminal articles 
in 2004-5 e.g. [T] [5] . The number of reticulations in an optimal solution to the HN problem is 
exactly one less than the number of components in an optimal solution to the MAAF problem [1] , 
making the problems essentially equivalent. 

The problem MAAF is NP-hard and APX-hard [5], although there do exist efficient FPT algo- 
rithms for the binary variant of the problem e.g. [H [TU1 HH HH1 ISO]- The nonbinary variant 
of the problem has received comparatively little attention, although that too is FPT j22[ [26] and 
some algorithms have been implemented (see |26j for a discussion). In both cases, the practical 
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Fig. 1.1. Two nonbinary rooted phylogenetic trees T\ and T2 and a maximum agreement forest F of Ti and T2. 



applicability of the FPT algorithms is limited to instances of small or moderate size [TH] , for larger 
instances approximation algorithms are required. In that regard, it was recently shown in |21j that, 
from an approximation perspective, binary MAAF is a very close relative of the classical problem 
directed feedback vertex set (DFVS), whose definition is presented in Section [2] Unfortunately, it is 
still a major open problem in approximation complexity whether DFVS permits a constant-factor 
polynomial-time approximation algorithm. The nonbinary variant of MAAF is of course at least 
as hard to approximate as the binary variant and is thus at least as hard to approximate as DFVS. 
Hence, for large instances of MAAF, neither FPT algorithms nor polynomial-time approximation 
algorithms are, at the present time, appropriate. In [19] we showed how, for the binary variant of 
MAAF, large instances can however be very well approximated using a specific marriage of MAF 
and DFVS solvers. This approach is exponential-time in the worst case but in practice is very fast 
and yields highly competitive approximation factors. The question remained whether a similar 
approach would work for nonbinary MAAF. 

In this paper, we show that a c-approximation algorithm for nonbinary MAF and a d-approximation 
algorithm for DFVS can be combined to yield a d{c + 3)-approximation for nonbinary MAAF. 
Combining this with our aforementioned polynomial-time 4-approximation for MAF, we obtain a 
7d-approximation for MAAF. As we discuss in the conclusion, it is likely that in practice d — 1 can 
be obtained using ILP to solve the generated DFVS instances. Hence, using the 4-approximation 
for nonbinary MAF we get an exponential-time 7-approximation for nonbinary MAAF and us- 
ing the FPT algorithm for nonbinary MAF we get an exponential-time 4-approximation for 
nonbinary MAAF. If we wish a purely polynomial-time approximation, we can use the best- 
known polynomial-time approximation algorithm for DFVS, yielding overall a polynomial-time 
0(log(fc) log(log(/c)))-approximation for nonbinary MAAF, with k the size of a maximum acyclic 
agreement forest minus one. We mention here that our algorithms for MAF are partly based on 
the ideas in [3U] and the algorithm for MAAF on the ideas in [TH] but the analysis is different in 
both cases because nonbinary agreement forests are substantially different from binary agreement 
forests. In fact, high-outdegree vertices pose a formidable challenge because of the freedom to 
refine these vertices in one of exponentially many ways. 

An implementation of both our MAF algorithms in Java has been made publicly available [20] . 



2. Preliminaries and statement of results. Let A be a finite set (e.g. of species). A 
rooted phylogenetic X-tree is a rooted tree with no vertices with indegree-1 and outdegree-1, a 
root with indegree-0 and outdegree at least 2, and leaves bijectively labelled by the elements of X. 
We identify each leaf with its label. We henceforth call a rooted phylogenetic A-tree a tree for 
short. Note that we do not restrict phylogenetic trees to be binary. A tree T is a refinement of 
a tree T' if T' can be obtained from T by contracting edges. For a tree T and a set X' C A, 
we define T(X') as the minimal subtree of T that contains all elements of X', and T\X' as the 
result of suppressing all vertices with in- and outdegree 1 of T(X'). The set of leaves of a tree T is 
denoted L(T). We say that tree T' is displayed by tree T if T' can be obtained from a subtree of T 
by contracting edges. If T' is displayed by T, then T(L(T')) is the embedding of T' in T. 



Throughout the paper, we usually refer to directed edges (arcs) simply as edges and to directed 
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paths simply as paths. If e = (u, v) is an edge of some tree, then we say that v is a child of u, 
that u is the parent of v, that u is the tail of e, that u is the head of e and we write tail(e) = u 
and head(e) = v. 

A forest is defined as a set of trees. To avoid confusion, we call each element of a forest a component, 
rather than a tree. Let T be a tree and F a forest. We say that J" is a forest for T if: 

• each component F e F is a refinement of T|L(T); 

• the subtrees {T(L(F)) | F e J 7 } arc edge-disjoint; and 

• the union of L(F) over all F £ F is equal to L(T). 

By this definition, if J 7 is a forest for some tree T, then {L(F) | F <G F} is a partition of the leaf 
set of T. It will indeed sometimes be useful to see an agreement forest as a partition of the leaves, 
and sometimes to see it as a collection of trees. 

If T\ and T 2 are two trees, then a forest F is an agreement forest of Ti and T 2 if it is a forest 
for Ti and a forest for T 2 . The size of a forest F, denoted by \F\, is defined as the number of its 
components. We consider the following computational problem. 

Problem: Nonbinary Maximum Agreement Forest (Nonbinary MAF) 
Instance: Two rooted phylogenetic trees Ti and T 2 . 
Solution: An agreement forest F of Ti and T 2 . 
Objective: Minimize \F\ — 1. 

We define the inheritance graph IG(T\, T 2 , J 7 ) of an agreement forest, as the directed graph whose 
vertices are the components of F and which has an edge (F, F') precisely if either 

• there exists a path in Ti from the root of T\{L{F)) to the root of T\(L(F')) containing 
an edge of Ti(L(F)) or; 

• there exists a path in T 2 from the root of T 2 (L(F)) to the root of T 2 (L(F')) containing 
an edge of T 2 (L(F)). 

Note that if there exists a path in F (for i £ {1,2}) from the root of F{L(F)) to the root of 
Ti(L(F')) containing an edge of Ti(L(F)), then this directly implies that this path also contains 
such an edge that has the root of Ti(L(F)) as tail. 

An agreement forest F of Ti and T 2 is called an acyclic agreement forest if the graph IG{T\, T 2 , F) 
is acyclic. 

We call a forest an F -splitting if it is an acyclic agreement forest that can be obtained from a 
refinement of F by removing edges and suppressing vertices with in- and outdegree 1. 

A maximum acyclic agreement forest (MAAF) of Ti and T 2 is an acyclic agreement forest of T\ 
and T 2 with a minimum number of components. 

Problem: Nonbinary Maximum Acyclic Agreement Forest (Nonbinary MAAF) 
Instance: Two rooted phylogenetic trees Ti and T 2 . 
Solution: An acyclic agreement forest F of Ti and T 2 . 
Objective: Minimize \F\ — 1. 

We use the notation MAF(Ti,T 2 ) and MAAF(Ti,T 2 ) for the optimal objective value of, respec- 
tively, a maximum agreement forest and a maximum acyclic agreement forest for input trees Ti 
and T 2 . Hence, if A is a maximum agreement forest and M. is a maximum acyclic agreement 
forest, then MAF(Ti,T 2 ) = \A\ - 1 and MAAF(Ti,T 2 ) = \M\ - 1. 
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We note that in the MAF and MAAF literature it is commonplace to assume that there is a leaf 
labelled p attached to the root of each input tree: see [3T] and earlier articles. In this article we 
omit p. The extra leaf p was originally introduced into the binary MAF literature to ensure that 
optimum solutions to MAF correctly correspond to optimum solutions to the rSPR problem [5]. 
If one defines MAF, as we do, without p, then we can easily simulate the involvement of p by 
introducing a new taxon x' £ X which is a child of the root in both the input trees. The MAAF 
literature grew out of the MAF literature and thus inherited the use of p. However, for MAAF p 
is not necessary at all, and this is tightly linked to the acyclic character of the inheritance graph. 
Although we omit the proof, it is easy to show that if we have a solution to MAAF in which p 
appears on its own in an isolated component, we can obtain a better solution by grafting p onto 
any of the components C that have indegree in the inheritance graph. This can create new 
outgoing edges from C in the inheritance graph, but it cannot create new incoming edges, and 
hence preserves the acyclicity of the graph. For both these reasons (the ease with which p can 
be simulated in the case of MAF, and its redundancy in the case of MAAF) we choose to omit p 
from this paper. 

The MAAF problem is closely related to an important problem in phylogenetics. A rooted phylo- 
genetic network is a directed acyclic graph with no vertices with indegree 1 and outdegree 1 and 
leaves bijectively labelled by the elements of X. Rooted phylogenetic networks, will henceforth be 
called networks for short in this paper. A tree T is displayed by a network NifT can be obtained 
from a subtree of N by contracting edges. Using 8~ (v) to denote the indegree of a vertex u, 
a reticulation is a vertex v with 5~(v) > 2. The reticulation number of a network N is given 
by 



r(N)= ]T («"(«)- 1). 

v.8-{v)>2 

In pQ it was shown that, in the binary case, the optimum to MAAF is equal to the optimum of the 
following problem. Later, in [22], this characterisation was extended to nonbinary trees. 

Problem: Nonbinary Minimum Hybridization (Nonbinary MH) 
Instance: Two rooted phylogenetic trees T\ and T 2 . 
Solution: A phylogenetic network TV that displays Ti and T 2 . 
Objective: Minimize r(N). 

Moreover, it was shown that, for two trees Ti,T2, any acyclic agreement forest for T\ and T 2 
with k + 1 components can be turned into a phylogenetic network that displays T\ and T 2 and 
has reticulation number fc, and vice versa. Thus, any approximation for Nonbinary MAAF gives 
an approximation for nonbinary MH. 

There is also a less obvious relation between MAAF and the directed feedback vertex set problem, a 
problem which is well-known in the communities of theoretical computer science and combinatorial 
optimisation. A feedback vertex set of a directed graph is a subset of the vertices that contains at 
least one vertex of each directed cycle. Equivalently, a subset of the vertices of a directed graph 
is a feedback vertex set if removing these vertices from the graph makes it acyclic. 

Problem: Directed Feedback Vertex Set (DFVS) 
Instance: A directed graph D. 
Solution: A feedback vertex set S of D. 
Objective: Minimize \S\. 
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In the next two sections we will prove the following theorems. In Section 3, we relate approxima- 
bility of Nonbinary MAAF to that of Nonbinary MAF and DFVS. 



Theorem 2.1. If there exists a c- approximation for Nonbinary MAF and a d- approximation for 
DFVS, then there exists a d(c+ 3) -approximation for Nonbinary MAAF and hence for Nonbinary 
MH. 

In Section 4, we design a polynomial-time approximation algorithm for (Nonbinary) MAF and 
prove constant factor approximability. 

Theorem 2.2. There is a polynomial-time 4~ approximation for (nonbinary) MAF. 



Moreover, the proof of Theorem 2.2 almost directly leads to fixed-parameter tractability. 

Theorem 2.3. Nonbinary MAF can be solved exactly in 0(A k poly(n)) time, with n the number 
of leaves and k the number of components of a maximum agreement forest minus one. 

Combining Theorems |2.1| and |2.2| above, we obtain the following corollary. 



COROLLARY 2.4. // there exists a d- approximation for DFVS then there exists a 7 'd- approximation 
for Nonbinary MAAF and hence for Nonbinary MH. 

Moreover, using the 0(log(r) log(log(r)))-approximation for weighted DFVS from [13], with r the 
weight of a minimum feedback vertex set, we also obtain the following. 

Corollary 2.5. There exists a polynomial-time 0(log(fc) log(log(fc))) -approximation for nonbi- 
nary MAAF, with k the number of components of a maximum acyclic agreement forest minus 
one. 



3. Approximating nonbinary MAAF. In this section we prove Theorem |2.1| We will 

show that, for two nonbinary trees, MAAF can be approximated by combining algorithms for 
MAF and DFVS. 

An agreement forest A is said to be maximal if there is no agreement forest that can be obtained 
from A by merging components. It is clear that, given any agreement forest, a maximal agreement 
forest with at most as many components can be obtained in polynomial time. 

Let Ti and T 2 be two nonbinary trees. Consider a maximal agreement forest A and a maximum 
acyclic agreement forest M. for these two trees. We will first prove that there exists an ^-splitting 
of size at most |.4| + 3|.M|. After that, we will show how the problem of finding an optimal 
^4-splitting can be reduced to DFVS. 

The idea of the first part of the proof is to split components of A according to M. . We show that 
to make A acyclic we will increase the number of components of A by at most three times the size 
of M. 

We start with some definitions. In the first part of the proof, we see an agreement forest A for T\ 
and T2 as a partition of the leaf set X for which holds that: 

1. Ti\A and T 2 \A have a common refinement, for all A € A; and 

2. the subtrees {Ti(A) \ A e A} are edge-disjoint, for i e {1, 2}. 

Using this definition of agreement forests, an „4-splitting is an acyclic agreement forest that can 
be obtained by splitting components of A. The following observation is easily verifiable. 

Observation 1. If M is an acyclic agreement forest and M' is an agreement forest that can be 
obtained from A4 by splitting components, then M 1 is an acyclic agreement forest. 

For a component A of an agreement forest for two trees T\ and T2, we write ri(A) to denote the 
root of Ti(A), for i € {1,2}. For two components M and M' of Ai, we say that M is lower 
than M' if in the inheritance graph of M. there is a directed path (and hence an edge) from M' 
to M. Since the inheritance graph of M. is acyclic, M. contains some lowest element. Moreover, 
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for any subset of components of Ai, there exists a component that is lowest over that subset. For 
a component A of A and a component M of Ai , we say that M properly intersects A (and that A 
is properly intersected by M) if M fl A 7^ and ^4 \ M ^ 0. 

We are now ready to describe the procedure for splitting A Initially, all components of A and Ai 
are unmarked. We iteratively choose a component M * of Ai that is lowest over all unmarked 
components of Ai . For each component A of A that is properly intersected by M* , we split A into 
two components A D M* and A \ M* and we mark A n M* . Then we mark M* and any unmarked 
components A' of A with A' C Af * and proceed to the next iteration. We continue this procedure 
until all components of Ai are marked. 

It is clear that, if A* is the agreement forest obtained at the end of some iteration, then no marked 
component of Ai properly intersects any component of A*. Moreover, no marked component of A* 
is properly intersected by any component of At . 

Lemma 3.1. Let A be an agreement forest for T\ andT 2 , let M* be a component of At that is lowest 
over all unmarked components of Ai and let A be a component of A that is properly intersected 
by M* . Then, Ti(A(~] M*) andTi(A\M*) are edge-disjoint subtrees ofTi, fori G {1,2}. 

Proof. Suppose to the contrary that in at least one tree, say T\ , there exists an edge e such that 
e £ Ti(ifl M*) and e 6 T\{A \ M*). Consider the set of leaves S that can be reached from e. 
Clearly, some leaf of A \ M* has to be in S. Let x G S D (A \ M*). Clearly, x must be in some 
component of Ai other than M* . Call that component M' . Since components of Ai are edge 
disjoint, n(M') has to be below e. Because e G Ti(AnM*), it follows that e G T X (M*) and hence 
that M' is lower than M*. Since M* is lowest over all unmarked components of Ai, it follows 
that M' is marked and hence that M' does not properly intersect any component of A. This is a 
contradiction because M' properly intersects A. □ 



Lemma 3.1 shows that when we split a component, the resulting two subtrees are edge-disjoint. 
This property holds for all components properly intersected by M* . Observe that when we split a 
component, the two newly-created subtrees cannot possibly share edges with other components, 
because of the assumption that at the start of the iteration all the subtrees were edge-disjoint. 
Hence, at the end of the iteration, all of the subtrees are mutually edge-disjoint. To show that we 
still have an agreement forest, it is necessary to show that the components still obey the refinement 
criterion. This follows from the following (unsurprising) observation. 

Observation 2. Let T* and T 2 be two trees on taxon set X* such that T* and T 2 have a common 
refinement. Then, for any X^ C X* , T*\X* and T 2 \X' have a common refinement. 

We have now shown that the result of each iteration is an agreement forest for T\ and T 2 such that 
no marked component of Ai properly intersects any component of this agreement forest. Let A' 
be the agreement forest obtained at the end of the last iteration. Since at the end of the procedure 
all components of Ai are marked, no component of Ai properly intersects any component of A'. 
It follows that A' can be obtained from Ai by splitting components. Since Ai is acyclic, it follows 
by Observation [l] that A' is acyclic. Hence, A' is an „4-splitting. It remains to bound the size of 
this *A-splitting. To do so, we will need the following observation and lemma. 

Observation 3. Let Ai and A 2 be components of some agreement forest for 7\ andT 2 . If A\ 
and A 2 have the same root in both 7\ and T 2 , then the result of merging Ai and A 2 into a single 
component A\ U A 2 is still an agreement forest of T\ and T 2 . 

Lemma 3.2. If A is the agreement forest at the beginning of some iteration, then there are no 
four unmarked components of A that have a common vertex in both trees. 

Proof. First let A be the agreement forest at the beginning of the first iteration. Suppose that 
Ai, A 2 , A3, A4 are unmarked components of A and that Vi is common to Ti(A\), . . . , Ti{A^) for i 6 
{1, 2}. For each i £ {1, 2}, there is at most one j G {1, 2, 3, 4} for which Ti(Aj) contains the edge 
entering m. Hence, there are at least two components, say A\ and A 2 , that do not use this edge 
in either tree. It follows that ri(A\) — ri(A 2 ) = Vi for i G {1,2}. However, then A\ and A 2 
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can be merged into a single component by Observation [3j which is a contradiction because A is 
maximal. 

We have shown that the lemma is true at the beginning of the first iteration. Now assume that 
it is true at the beginning of some iteration. Each component that is split during the iteration 
is split into one marked and one unmarked component. Hence, for each vertex, the number of 
unmarked components using that vertex does not increase. It follows that the lemma is still true 
at the end of the iteration. This completes the proof. □ 

Lemma 3.3. Let A be the agreement forest at the beginning of some iteration. If M* is a 
component of M that is lowest over all unmarked components of A4, then M* properly intersects 
at most three components of A. 

Proof. Let A be a component of A that is properly intersected by M* . We claim that there is a 
directed path from r t {A) to r-j(M*) for i € {1, 2} (possibly, r^A) = ri(M*)). To see this, first note 
that, since AC\M* ^ 0, there must be either a directed path from Ti{A) to ri(M*), or from rj(M*) 
to Ti(A), for i £ {1,2}. Suppose that this path goes from r^(M*) to rj(A) and contains at least one 
edge. Since A n M* ^ 0, this path has to contain at least one edge of Tj(M*). Now observe that, 
since M* properly intersects A, there exists some a £ A \ M*, which is in some component of M, 
say in M' . Note that M' is unmarked since it properly intersects A. However, then we obtain 
a contradiction because M' is lower than M* , while M* is lowest over all unmarked components 
of M.. Hence, there is a directed path from ri(A) to ri(M*) for i £ {1, 2}. This path is contained 
in Ti{A) because A n M* ^ 0. 

Now assume that the lemma is not true, i.e. that there exist four components Ay, A 2 , A 3 , A 4 of A 
that are properly intersected by M*. We have seen that there is a directed path from ri(Aj) 
to ri(M*), which is contained in T^Aj), for i £ {1,2} and j £ {1,2,3,4}. Hence, r,(M*) is 
common to all four components Ay, A2, A3, A4. Moreover, Ay, A2, A3, A4 are all unmarked since 
they are properly intersected by M* . This is a contradiction to Lemma |3.2| □ 

Lemma |3.3| shows that each iteration splits at most three components. Each split adds one 
component. Thus, for every component M of M., the number of components of the ,4-splitting is 
increased by at most three, which concludes the proof of the following theorem. 

Theorem 3.4. Let T\ and T 2 be two (nonbinary) trees. If A is a maximal agreement forest 
and M. a maximum acyclic agreement forest of Ty andT 2 , then there exists an A- splitting of size 
at most \A\ + 3\M\. 

Now, suppose we have a c-approximation A for MAF; i.e., 

\A\ - 1 < c ■ MAF{T 1 ,T 2 ) < c ■ MAAF(T 1 ,T 2 ). 

Let OptSplit(A) denote the size of an ^-splitting with smallest number of components. The last 
inequality together with Theorem |3.4| imply that 

OptSplit(A) - 1 < \A\ - 1 + 3 • MAAF(Ti,T 2 ) < (c + 3) ■ MAAF{Ti,T 2 ). 

The remaining part of the proof will be accomplished by reducing the problem of finding an 
optimal yl-splitting to DFVS in such a way that a d-approximation algorithm for DFVS gives a 
d-approximation for an optimal „4-splitting. Combining this with the above inequality, the proof 
of Theorem 12. II will follow. 

From now on, we see an agreement forest as a collection of trees, as specified in the preliminaries 
section. We will assume that the components of A are never more refined than necessary (i.e. there 
is no agreement forest of T x and T 2 that can be obtained from A by contracting edges) . 

To prepare for the construction of an input graph to DFVS, we label the vertices and edges of T\ 
and T 2 by the vertices and edges of A that they correspond to. Note that each vertex of A is used 
exactly once as a label but, due to refinements, some vertices of the trees can have multiple labels. 
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A IG(A) 



Fig. 3.1. Two trees T\ and T%, an agreement forest A for T\ and T2 and its inheritance graph IG(A). Leaf 
labels have been omitted. 




Fig. 3.2. The trees Ti and T2 from Figure \3. 1\ labelled with the internal vertices and all edges of A. 

Moreover, some edges of A can be used as a label multiple times (when the edge corresponds to a 
path in a tree) but each edge is used as a label at least once in at least one tree (by the assumption 
that A is not more refined than necessary). For example, for the trees and agreement forest in 
Figure |3.1[ the labelling is illustrated in Figure |3.2| 

We construct an input graph D for DFVS as follows. For every internal vertex of A, we create a 
vertex for D. Denote the set of these vertices by Vv{D). In addition, for every edge of A we create 
a vertex for D. This set of vertices is denoted by V E (D). We write V(D) = V V {D) U V E {D). We 
create edges of D as follows. For every v £ Vy(D) and every e G Ve(D) create an edge (v, e) if 
tail(e) = v in A and we create an edge (e, v) if v can be reached from head{e) in at least one tree, 
for some edge e labelled by e and the vertex v labelled by v. See Figure [373] for an example. 

We will show that any feedback vertex set of D corresponds to an yl-splitting and vice-versa. 
Moreover, after appropriately weighting the vertices of D, the number of "splits" of the .A-splitting 
(i.e. the size of the ^4-splitting minus the size of A) will be equal to the weight of the corresponding 
weighted feedback vertex set. 

Let F C V(D). In what follows, we use F both for sets of vertices in D and for the set of 
vertices and edges they represent in A. Let A \ F be the forest obtained from A by removing the 
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Fig. 3.3. Part of the input graph D for DFVS for the trees T\,T2 and agreement forest A. of Figures \3.1\ 
and \3.2\ Vertices that do not appear in any directed cycle of D have been omitted. Minimum feedback vertex sets 
of D are {«i} and {a}. 



vertices and edges of F and repeatedly removing indegree-0 outdegree-1 vertices and suppressing 
indegree-1 outdegree-1 vertices. 

Lemma 3.5. A subset F of V(D) is a feedback vertex set of D if and only if A \ F is an A- 
splitting. 

Proof. It is clear that A \ F is an agreement forest of T\ and T^. Hence, it is enough to show that 
D\F has a directed cycle if and only if IG(A \ F) has a directed cycle. 

First, let C\, C2, ■ ■ ■ , Ck — C\ be a cycle in IG(A\ F). We will show that this implies that 
there exists a cycle in D \ F. Let u\, 112, ■ ■ ■ , Uk be the roots of the components C\, C%, . . . , Ck, 
respectively. Notice that u\, 112, ■ • • , Uk are internal vertices of A and hence represented in D. 
Moreover, since these vertices are in A \ F, they are also in D \ F. 

We prove this side of the lemma by showing that the presence of an edge (Ci, Cj+i) of IG(A \ F) 
implies the existence of a directed path from Ui to Ui+i in D \ F, for i = 1, . . . , k — 1. By the 
definition of the inheritance graph, an edge (Ci,Ci+\) of IG(A \ F) implies that there exists a 
directed path from w.j to Ui + \ that uses an edge of Ci in at least one of the trees. It is easy to 
see that this directed path uses an edge, a say, of Ci that has Ui as its tail. Moreover, since Ci 
and Cj+i are components of A \ F , u\,a and tti+i are vertices of D \ F and (ui, a) and (a, iti+i) 
are edges of D \ F, forming a directed path from Ui to u^+i in D \ F. 

Now, assume that D\F contains a cycle. Let ui, ...,Uk — ui be a longest cycle. We will show 
that this implies the existence of a cycle in IG(A\F). Clearly, u\, . . . ,Uk cannot all correspond to 
vertices and edges from the same component of A \ F . From the assumption that u\, Uk = u% 
is a longest cycle, it can be argued that there are at least two vertices among u\, . . . ,Uk that are 
roots of components of _4\F. Let 7*1, . . . , r s — r\ denote those vertices of the cycle that correspond 
to roots of components. Let C\, . . . ,C s be the corresponding components of A \ F , respectively. 
We will show that C\,...,C S form a cycle in IG(A \ F). 

For i = 1, . . . , s — 1, we show that there exists an edge (Ci, Ci+i) in IG(A\ F). First observe that 
there exists a directed path from r,; to rj+i in D \ F and that, since D is bipartite, this path is of 
the form 

(n = v Q ,ei,vi,e 2 , ...,e t ,v t = r i+ i) 

where e±, . . . , et are edges of A \ F and v\, . . . , Vt-\ are internal vertices of Ci. 

First assume t — 1. In this case, the path is just (r i; e%, r i+ i). By the definition of D, tail(e\) — Ti 
in Ci and there is a directed path from head(ii) to r^+i, with e\ some edge labelled by ei, in at 
least one of the two trees (such an edge e\ definitely exists because of the assumption that A is 
not more refined that necessary). This tree then contains a path from to r^ + i that contains 
edge ei of Ci. Hence there is an edge (d, Q+i) in IG(A \ F). 
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Now assume t > 1. For j = 1, . . . , i, by the definition of D, tail(ej) = Vj-i in Cj and there is 
a directed path from head(ej) to u J5 with some edge labelled by e^, in at least one of the two 
trees. For j < t, ej and Uj are both in the same component Cj, implying that there is a directed 
path from head{ej) to Vj in Cj. Hence, there is a directed path from r% to Vt-i in both trees. 
Thus, the tree that contains a directed path from Vt-\ to rj+i also contains a directed path from ri 
to r.j + i containing edge e t of C,-. Hence, (Cj, Cj+i) is an edge of IG(A \ F). □ 

The above lemma showed that we can turn A into an .A-splitting by removing vertices and edges 
corresponding to a feedback vertex set of D. We will now add weights to the vertices of D in order 
to enforce that an optimal feedback vertex set gives an optimal „4-splitting and, moreover, that 
an approximate feedback vertex set gives an approximate „4-splitting. 

We define a weight function w on vertices of D as follows, using 8\{v) to denote the outdegree of 
a vertex v in A. 



w(v) 



1 i£veV E (D). 



The weight of a feedback vertex set F is defined as w(F) = J2veF w ( v )- 

Intuitively, the weight of each vertex v equals the number of components its removal would add 
to the .A-splitting. 

To make this precise, we need the following definition. We call a feedback vertex set F proper if it 
is minimal (i.e. no proper subset of F is a feedback vertex set) and if for every vertex v G Vy{D) 
at most 8~£)(v) — 2 children of v in D are contained in F, with S~j^(v) denoting the outdegree of v 
in D. (Recall that the children of v in D are elements of Ve (£>)). For example, proper feedback 



vertex sets of the graph D in Figure 3.3 are {ui} and {v±, wi}. The idea behind this definition is 
that when all, or all but one, of the children of v are in F, then we could just as well add v instead, 
which does not increase the total weight of the feedback vertex set. Hence, given any feedback 
vertex set F, a proper feedback vertex set F' with w(F') < w(F) can be found in polynomial 
time. 

Lemma 3.6. If F is a proper feedback vertex set of D, then A \ F is an A-splitting of size 
\A\+w{F). 

Proof. For an edge e of A, we use tail(e) and head(e) to refer to the tail and head of e in A, 
respectively. 

Consider a vertex of F corresponding to an edge e of A. Then, every cycle in D that contains e 
also contains tail(e). Hence, since F is minimal, tail{e) is not contained in F. Moreover, suppose 
that tail{e) is not the root of a component of A and let e' be the edge with head(e / ) = tail(e). 
Then, for every cycle in D containing e but not e' (and hence containing tail(e) but not tail(e')), 
replacing e and tail(e) by e! and tail(e') gives again a directed cycle in D. Hence, since F is 
minimal, it follows that either e' or tail(e') is contained in F, but not both. 

Now consider a vertex of F corresponding to a vertex v of a component C of A. Then, by the 
previous paragraph, no edge e with tail(e) — v is contained in F. Moreover, if v is not the root 
of C and e' is the edge with head(e') — v, then, as before, either e' or tail(e') is contained in F, 
but not both. 

To summarize the previous two paragraphs, if F contains a vertex corresponding to a vertex v 
of A that is not the root of its component, then F also contains either the edge entering v or the 
parent of v. Similarly, if F contains a vertex corresponding to an edge e of A that is not leaving 
the root of its component, then F also contains either the edge entering tail(e) or the parent 
of tail{e). Informally speaking, this means that if you remove something from a component, you 
also remove everything above it. 
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a b c d e f g a b c d e f g a b c d e f g a b c d e f g 
T F refine „ cut „ F 

1 ► — ► 2 

Fig. 4.1. A tree T and two forests Fx and F2 for T. Forest F2 can be obtained from Fi by refining the parent 
ofa,b and c and subsequently removing an edge and suppressing an indegree-1 outdegree-1 vertex. 

More formally, it follows that A \ F can be obtained from A by, repeatedly, either removing the 
root of a component or removing a subset of the edges leaving the root of a component. Moreover, 
if we remove a subset of the edges leaving the root of a component, at least two of these edges are 
not removed, because F is proper. 

Removing edges and vertices from A in this way, it is easy to see that the number of components 
is increased by w(F). □ 

To complete the proof, let A be a c-approximation to MAF, F a minimal feedback vertex set that 
is a d-approximation to weighted DFVS on D and F* an optimal solution to weighted DFVS 
on D, then we have: 

|>t\F| -1= \A\+w(F) - 1 

< \A\+d-w{F*) - 1 

< d(\A\ + w{F*) - 1) 
= d(OptSplit(A) - 1) 

< d(c + 3)MAAF(T 1 ,T 2 ). 

We have thus shown how to construct an agreement forest that is a d(c + 3)-approximation to 



MAAF and with that we conclude the proof of Theorem 2.1 



4. Nonbinary MAF. In the first part of this section we present a 4-approximation algorithm 



for MAF, proving Theorem 2.2 The performance analysis leads almost straightforwardly to a 



fixed parameter tractability result, which we present in the second part. 

4.1. Approximation of nonbinary MAF. Let Ti and T 2 be the input trees to MAF. 
They do not have to be binary. We will construct a forest by "cutting" T 2 . A "cut" operation 
of T 2 consists of removing an edge (and suppressing indegree-1 outdegrcc-1 vertices) or of first 
refining a vertex (with outdegree greater than 2) and then removing an edge (and suppressing 
indegree-1 outdegree-1 vertices), see Figure |4~T| 



Let F 2 be the forest obtained by cutting T 2 . In each iteration, we further cut F 2 until at some 
point F 2 becomes a forest also of T\. At that point, we have successfully obtained an agreement 
forest for T\ and T 2 and we terminate the algorithm. 

We describe an algorithm to determine which edges to cut by defining an iteration of the algorithm. 
Suppose that at the start of the iteration we have an (intermediate) forest F 2 . There are two main 
cases. In each case, the algorithm will make at most 4 cuts, and we will show that at least one of 
these cuts is unavoidable, thus showing that in each case a 4-approximation is attained. 

Take an arbitrary internal vertex u of T\ with the property that all its children arc leaves. Such a 
vertex clearly exists in any tree. Let C be the set of children of u in T\ and C the set of all leaves 
that arc not in C . 
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First the algorithm checks the following three simple cases in the given order. 
Case Oa. There exist c\, c 2 € C that have a common parent in F 2 . 

In this case, we collapse the subtree on c\ and c 2 to a single leaf in both T\ and F 2 . To be precise, 
we do the following in both T\ and F 2 . If c\ and c 2 have no siblings, we delete c\ and c 2 and label 
their former parent by {ci, C2}. If c\ and C2 do have siblings, we delete C\ and replace label C2 by 
label {ci, C2}. 

Case Ob. Some leaf c £ C is an isolated vertex in F 2 . 

In this case, we remove c from both T\ and F2 and suppress any resulting outdegree-1 vertices. 
At the end, after recursively having computed an agreement forest, we add an isolated vertex 
for c. 

Case Oc. The leaves in C are all in different components of F 2 . 

In this case, we remove all c 6 C from both T\ and F 2 and suppress any resulting outdegree-1 
vertices. At the end, after recursively having computed an agreement forest, we add isolated 
vertices for all c G C . 

Correctness of the procedure followed in the first two cases is obvious. To prove correctness in 
Case Oc, let F be an agreement forest of T\ and T 2 that can be obtained by cutting F 2 . Since 
the leaves in C are all in different components of F 2 , they are all in different components of F. 
Observe that any component of F that contains an element of C and an element of C has to use 
the edge entering u in T±. It follows that at most one component of F contains an element of C 
and an element of C. Since the elements of C are all in different components of F, it follows that 
at most one element from C is not a singleton in F. Hence, at most one of the cuts made in this 
step is avoidable. At least 2 cuts were made because \C\ > 2 and no c g C was already an isolated 
vertex in F 2 by Case Ob. It follows that at least half of the cuts made were unavoidable. 

If none of the above cases applies, the algorithm picks c\, c 2 from C in such a way that c\ and c 2 
are in the same component A 2 of F 2 and such that their lowest common ancestor in A 2 is at 
maximum distance from the root of the component. Moreover, if there exists such a pair for which 
neither of C\ and c 2 is a child of their lowest common ancestor, we pick such a pair first. 

Case 1. Neither of c\ and c 2 is a child of their lowest common ancestor v in F 2 . 

Let pi be the parent of c\ and p 2 the parent of c 2 in F 2 . (We have pi ^ p 2 because we are not 
in Case Oa.) Let Si be the set of all leaves that are descendants of p\ except for c\. Similarly, 
let ^2 be the set of all leaves that are descendants of p 2 except for c 2 . Finally, let R be the set of 
other leaves of component A 2l i.e. leaves that are not in S\ U S% U {ci, C2}. We cut A 2 by creating 
separate components for ci, c 2 , S\,S 2 and R, thus making four cuts, see Figure [472] 

We claim that at least one of these four cuts was unavoidable. Let F be an agreement forest 
of T\ and T 2 with a minimum number of components. If any of c\,c 2 is a singleton in F then 
the cut that removed the edge entering that vertex was unavoidable. Hence, we assume that C\ 
and c 2 are both non-singletons in F. Suppose c\ and c 2 are in the same component of F. Because 
of the choice of c\ and c 2 having their lowest common ancestor furthest from the root of the 
component, Si,S 2 C C\ they contain no element of C. Hence cutting off both S\ and S 2 is 
unavoidable. 

Hence, we assume that c\ and c 2 are in different, non-singleton components of F. As argued 
before, in justifying Case Oc, at most one of c\ and c 2 can be in a component with elements 
from C. Moreover, as also argued before, S\,S 2 C C. Therefore, at most one of c\ and c 2 can 
be in a component with elements from S\ U S 2 - W.l.o.g. suppose c\ is not in a component with 
elements from SiUS^. Since c\ is not a singleton component, it is contained in a component which 
uses the edge of A 2 entering p\ and the edge from p\ to c\ , but none of the other edges leaving p\ . 
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Fig. 4.2. Case 1: none o/ci and C2 is a child of their lowest common ancestor. We cut F2 into F'^ by creating 
separate components for ei,C2,Si, S2 and R. 




Fig. 4.3. Case 2: ci is a child of the lowest common ancestor of ci and C2- We cut F2 into F^ by creating 
separate components for ci,C2, Si, S2 and R. 



Hence, the cut cutting off Si is unavoidable. Similarly, if C2 is not in a component with elements 
from Si U S2, then the cut cutting off S2 is unavoidable. Thus, always at least one of the four cuts 
was unavoidable. 



Case 2. Either c% or c 2 is a child of their lowest common ancestor d in f 2 . 

Let ci, C2 be any such pair. Let again p\ be the parent of C\ and P2 the parent of ci in _F 2 . Assume 
without loss of generality that p\ is the lowest common ancestor of ci and ci . Let Si contain all 
leaves that are descendants of P2 except for C2. Let Si contain all leaves that are descendants of pi 
except for ci,C2 and the leaves in S*2. Let R contain all remaining leaves. We cut F2 by creating 



separate components for ci, C2, Si, S2 and R, thus making four cuts, see Figure 4.3 



We claim that also in this case at least one of the four cuts was unavoidable. Let F again be an 
agreement forest of T± and T2 with a minimum number of components. We can argue as before 
that we can restrict attention to the situation in which c\ and C2 are non-singletons and belong to 
different components. It is also again true that Si and S2 cannot contain elements from C. To see 
this, first note that no element C3 of C\{ci} can be a child of p\ because then c\ , C3 € C would have 
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a common parent in F2, which is a Case Oa situation. Moreover, no C3 G C\{c%, C2} can be reached 
from pi by a directed path with at least one internal vertex that does not belong to the path from 
Pi to C2, because then C2, C3 would conform to Case 1. Finally, as before, no C3 G C \ {ci, C2} can 
be reached from an internal vertex of the path from p\ to C2 because then C3 and Ci would have a 
lowest common ancestor further away from the root than p\ . We conclude that S\ and 5*2 contain 
no elements from C. 

As in Case 1, it follows that at most one of c\ and C2 is in a component with elements from S\ US2. 
Also, similar to the arguments in Case 1, if ci is not in a component with elements from S\ U S2 
then that component uses the edge of A 2 entering pi and the edge from pi to Ci, but no other 
edges leaving p\. Hence, the cutting off Si U S% U {02} is unavoidable. If C2 is not in a component 
with elements from Si U S2, then cutting off 5*2 is unavoidable. 

Hence, we conclude that in each case at least one of the four cuts is unavoidable, and the algorithm 
thus yields a 4-approximation. 



4.2. An FPT algorithm for nonbinary MAF. In this section, we s how that there exists 
an 0(4 fc poly(n)) time algorithm for nonbinary MAF, i.e. we prove Theorem 2.3 



The algorithm follows the same ideas as the approximation algorithm in the proof of Theorem 2.2 
Cases Oa and Ob are executed in exactly the same way. In Case Oc, instead of removing all c G C, 
we pick ci,C2 G C arbitrarily and branch into two subproblems. In one subproblem c\ is removed 
and in the other subproblem C2 is removed. After recursively computing an agreement forest, the 
removed leaf is added as an isolated vertex. This step is correct since, by the proof of Theorem 1 2. 2 1 
in any maximum agreement forest at least one of c\ and C2 is an isolated vertex. For, Cases 1 
and 2, instead of making four cuts, we branch into four subproblems, one for each possible cut. 
By the proof of Theorem |2.2[ at least one of the four cuts is unavoidable, and hence at least one 
subproblem has the same optimum as the original problem. 

It remains to analyse the running time. In each step we branch into at most four subproblems. 
For each subproblem we make one cut, and hence we reduce the parameter k by one. Therefore, 
at most 4 fc subproblems are created. For each subproblem, we need only time polynomial in n. 
This concludes the proof. 



5. Conclusions and open problems. We have given improved FPT and polynomial-time 
approximation algorithms for nonbinary MAF, and demonstrated that, as in the binary case, 
algorithms for MAF and DFVS can be combined to yield nontrivial approximation guarantees 
for nonbinary MAAF. A number of interesting open problems remain. Firstly, the best known 
polynomial-time approximation algorithms for binary MAF have a factor of 3, and for nonbinary 
this is now 4. Might it be that the binary and nonbinary variants are equally approximable, or is 
the nonbinary variant in some sense strictly more difficult to approximate? For nonbinary MAAF 
we have shown how to achieve an approximation factor of d(c+3), but for binary the corresponding 
expression is d(c + 1), this gap is also something that needs to be explored. 

On the software side, we have implemented both our MAF algorithms and made them publicly 
available [20]. We can report the following provisional performance results. The FPT algorithm 
solves instances with k < 14 within a few minutes. The approximation algorithm solves instances 
with n < 500 within seconds (and possibly larger instances too). The highest approximation- factor 
encountered on the inputs we tried was 2.5 (which raises the question whether the 4-approximation 
analysis given in this article can actually be sharpened). For nonbinary MAAF we have not yet 
implemented the d(c + 3) algorithm but are planning to do so. As in [19], it should be possible to 
obtain d = 1 by using Integer Linear Programming (ILP) to solve DFVS exactly. 
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